Gunther et al. BMC Bioinformatics 201 1, 12:265 
http://www.biomedcentral.eom/1 471 -21 05/1 2/265 



Bioinformatics 



SOFTWARE Open Access 



phenosim - A software to simulate phenotypes 
for testing in genome-wide association studies 

Torsten Gunther", Inka Gawenda and Karl J Schmid 
Abstract 

Background: There is a great interest in understanding the genetic architecture of complex traits in natural 
populations. Genome-wide association studies (GWAS) are becoming routine in human, animal and plant genetics 
to understand the connection between naturally occurring genotypic and phenotypic variation. Coalescent 
simulations are commonly used in population genetics to simulate genotypes under different parameters and 
demographic models. 

Results: Here, we present phenosim, a software to add a phenotype to genotypes generated in time-efficient 
coalescent simulations. Both qualitative and quantitative phenotypes can be generated and it is possible to 
partition phenotypic variation between additive effects and epistatic interactions between causal variants. The 
output formats of phenosim are directly usable as input for different GWAS tools. The applicability of phenosim 
is shown by simulating a genome-wide association study in Arabidopsis tholiono. 

Conclusions: By using the coalescent approach to generate genotypes and phenosim to add phenotypes, the 
data sets can be used to assess the influence of various factors such as demography, genetic architecture or 
selection on the statistical power of association methods to detect causal genetic variants under a wide variety of 
population genetic scenarios, phenosim is freely available from the authors' website http://evoplant.uni- 
hohenheim.de 



Background 

In recent years, genome-wide association studies 
(GWAS) became widely used to uncover the genetic 
basis of complex traits by comparing patterns of genetic 
and phenotypic variation [1-3]. The power of such stu- 
dies depends on various factors that include the genetic 
architecture of the trait, the demographic history of the 
population, and variation in mutation and recombina- 
tion rates [4]. In addition, the trait under investigation 
may be adaptive or (in case of a disease trait) can evolve 
under purifying selection, which both would result in a 
non-neutral pattern of genetic diversity in the genomic 
neighborhood of the causal mutation. 

Coalescent simulations are widely used to simulate 
genotypes under complex demographies [5] with recent 
extensions to include recombination hotspots [6] and 
selection [7], or to simulate whole genomes [8]. Simula- 
tions are often used to test population genetic 
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hypotheses by comparing simulated and observed data. 
However, such simulations produce only genotypes but 
not phenotypes, which are also required to test methods 
for detecting significant associations between genetic 
and phenotypic variation. Although some tools provide 
an option to map phenotypes onto simulated genotypes, 
they only allow the simulation of qualitative phenotypes 
[9] or require time-consuming forward-in-time simula- 
tions to create genotypes from complex demographic 
scenarios [10-13]. 

Here, we present phenosim, a tool written in Python 
[14] that was designed to add a phenotype to genotypes 
simulated by coalescent-based simulation tools. Simu- 
lated phenotypes may either be qualitative or quantita- 
tive traits with different effect sizes and may show 
epistatic interactions. Hence, the simulation of case/con- 
trol studies as well as the search for quantitative trait 
nucleotides (QTNs) of a complex trait with a user- 
defined architecture is possible. By combining simulated 
genotypes and phenotypes, researchers can assess the 
influence of different factors on the power of new 
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methods for association mapping, compare different 
methods or estimate an optimal sample size and number 
of markers for a given study design. 

Implementation 

The general work flow of phenosim is shown in Figure 1. 
First, the user simulates genotypes with one of four differ- 
ent programs for coalescent simulations. In the current 
version, phenosim is able to read the output of the ms 
[5], msH0T[6], msms[7] and GENOME [8] programs. After 
the import of genotypes, a phenotype generated under a 
user-defined model is assigned to each genotype. The trait 
can either be qualitative or quantitative. 

For qualitative traits, one- and two-locus models are 
supported. The user defines the model by setting the 
penetrance (probability of being affected) for all geno- 
types. In the two-locus model, this is done by a pene- 
trance table for all possible allelic combinations among 
the two loci. Therefore, the user may define arbitrary 
interactions between all alleles of the loci. The case/con- 
trol-status of all simulated individuals is then assigned 
according to the model. In many cases, disease states 



are caused by risk alleles segregating at low allele fre- 
quencies in the overall population. As such low fre- 
quency variants share a genealogy that may differ from 
high frequency variants and thus the linkage pattern 
around these variants may be different [15], the user 
can restrict causal mutations to a certain frequency 
range to obtain realistic risk loci. However, as this may 
result in a low number of cases in the final sample, 
users need to simulate larger populations and optionally 
enter a minimum number of cases to be sampled from 
the population. This procedure reflects the sampling 
procedure of many case/control studies. 

For quantitative traits, multiple QTNs with additive 
effects or epistatic interactions between two QTNs are 
possible. By default locations of causal variants are 
selected randomly. Nevertheless, the user can determine 
the position of a QTN manually and/or restrict the 
selection to an allele frequency range. A phenotype is 
generated based on the formulas of [16], which we gen- 
eralize for additive effects among multiple QTNs as fol- 
lows. The trait value is calculated by adding a fixed 
variance proportion explained by the QTN to a random 



Coalescent Simulation^ Phenotype Assignment! Subsamplingj Output/Analysis| 



one locus model 
case/control 



1- 



random I 
ositionl 



GENOMEl 

Liang et al. 2007 



H 



ms 

iHudson 2002 



msHOT L 

Hellenthal and Stephens,2007| 



msms 

Eving and Hermisson,20IO| 



H 



- ^Qualitative traitfr 



case/control 




without 




random 


epistatic 






position | 


effect 









^ epistatic I 
effect 



subsampling 
—H- No. of genotypes 
No. of markers 




located on 
haploblock 



4-gamete-testl 

Hudson and Kaplan, I985| 



random I 
position! 



1 



removal of 
causal marker 



EMMA/EMMAXl 

Kang et al. 2008, 2010 



] 



Blossoc/QBIossocI 



TASSEL I 

Bradbury et al., 2007| 



PLINK 

|Purcel et al., 2007| 



QTDT \ 

Abecasis et al., 200o| 



MERLIN I 

Abecasis et al„ 2002| 



phenosim 



Figure 1 Work flow. Flowchart of the phenosim pipeline. 
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number drawn from a standard normal distribution with 
mean 0 and standard deviation 1, N{0, 1). We provide 
two different models, depending on the ploidy of the 
individuals. The effect of the y-th QTN is ttj and the 
QTN has a derived allele frequency of fj. It should be 
noted that the sum of all QTN effects, Z ; Tip equals the 
heritability, /z 2 , of the trait. If the individuals are haploid, 
the allelic state of the z-th individual at the ;-th QTN is 
dip where := 0 if the allele is ancestral and := 1 if 
it is derived. Then the phenotype Y of individual i is cal- 
culated as: 



The phenotype of diploid individuals under an addi- 
tive model without dominance is calculated as: 



Y d (i)= /l-^xMO,!)^^^. (2) 

where := 1 if the y-th QTN is homozygous derived, 
Qij := 0 if the QTN is heterozygous and Q, ; := -1 if the 
QTN is homozygous ancestral. Dominant effects at each 
QTN and additive effects between loci are also sup- 
ported for diploids. In this case, equation (1) is used 
with a t j := 0 for homozygous ancestral QTNs and := 
1 for heterozygous and homozygous derived individuals. 

If exactly two QTNs are selected, a positive, additive 
epistatic effect n E between these QTNs can be simu- 
lated. This epistasis is modeled as a fictive third QTN, 
whose allelic state a iE is 1, if the individual carries at 
least one derived allele at both basal QTNs. For users 
with a some Python scripting experience, other types of 
epistasis can easily be simulated by modifying the code 
of phenosim. To simulate a causal haplotype or allelic 
heterogeneity among two causal variants within a single 
gene, both QTNs may also be located on a common 
haploblock defined by the four-gamete test [17]. 

To our knowledge, quant iNemo[ 12] is the only soft- 
ware that currently supports the simulation of interac- 
tions between QTNs. However, quant iNemo utilizes 
time-consuming forward simulations, whereas pheno- 
sim allows to include epistasis between QTNs within a 
time-efficient coalescent framework. 

After phenotypes have been generated, a predefined 
number of markers and/or individuals can be sub 
sampled from the total simulated population. The causal 
marker(s) can be optionally removed from the sample, 
since frequently the causal mutation itself is not geno- 
typed in a genome-wide study. Finally, genotypes and 
phenotypes are written into different output file formats 
that can be directly used as input for commonly used 
association programs such as Blossoc/QBlossoc 



[16,18], EMMA/EMMAX[19,20], PLINK[21], QTDT/MER- 
LIN[22,23] and TASSEL 3.0 [24]. A snapshot of phe- 
nosim is available as Additional File 1 whereas the 
most current version is maintained at http://evoplant. 
uni-hohenheim.de 

Results and Discussion 

To demonstrate the ability of phenosim to simulate 
data for GWAS, we utilized GENOME [8] and simulated 
populations N e = 1000, with a population recombination 
parameter of p = 8 • 10" 3 and 250,000 SNPs distributed 
over a 120 Mbp genome. These settings are comparable 
to data sets used for recent GWAS in A. thaliana 
[2,25,26]. phenosim was then used to generate pheno- 
types under three different models: (i) 2 QTNs, each 
with an effect of 0.05; (ii) 2 QTNs at random positions, 
each with an effect of 0.01, and epistatic interaction of 
ji e = 0.08; and (iii) 2 QTNs, located on a common hap- 
loblock, each with an effect of 0.01 and epistatic interac- 
tion of tt e = 0.08. In all three scenarios, the total 
proportion of variance explained by these QTNs and 
their interaction was identical (h 2 = 0.1). Four hundred 
chromosomes were sub-sampled and the causal poly- 
morphisms were removed from the data. EMMAX[20] 
was used to detect marker-trait associations and the 
causal locus for this hypothetical trait. In Figure 2 we 
show the proportion of significant markers that were 
found at a given distance from the causal locus. In the 
first model (only additive effects), less than 10% of the 
detected significant markers are located within a dis- 
tance of 10 kbp to the causal locus. A larger sample size 
may increase the power to detect such small additive 
effects in genome-wide scans. Despite the smaller addi- 
tive effect in model (ii), the number of significant mar- 
kers within 10 kbp of the QTN was comparable to 
model (i). Additionally, there is an increased number of 
significant associations further than 10 kbp from the 
QTNs. These may represent false positive associations 
caused by epistasis, such as markers that are in strong 
linkage disequilibrium with the fictive epistatic marker 
[27]. The highest power was observed in the third 
model. QTNs on a common haploblock with epistatic 
effects create a strong joint QTL and therefore in more 
than 75% of simulations, a significant marker was 
located within a distance of 10 kbp to the causal locus. 
The results show that single marker association methods 
as EMMAX are able to detect QTNs with small additive 
effects and a strong positive epistatic interaction. How- 
ever, in certain situations larger samples than simulated 
sizes are needed and some results may be confounded 
by false positives as discussed earlier [27]. 

On average, a single simulation ran 4 min with GENOME 
[8] and 2 min with phenosim on a single core of an Intel 
Xeon X5650 (2.66 GHz) Processor. To compare this 
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Figure 2 Example. Proportion of significant marker-trait 
associations (Bonferroni adjusted significance threshold) found at 
different distances from the causal marker. For each QTN, the 
distance to the next significant association is shown. The bars on 
the left show the proportion of simulated data sets for which no 
significant association was found. The simulated models include: 
'additive' - 2 QTNs randomly distributed with a variance proportion 
of 0.05 each; 'epLrand' - 2 QTNs randomly distributed with a 
variance proportion of 0.01 each and 0.08 epistatic effect; 'epLhapl' 
- 2 QTNs located on a common haploblock with a variance 
proportion of 0.01 each and 0.08 epistatic effect. Each model was 
simulated 1000 times. 



efficiently. Additionally, as different causal markers may 
contribute different effects to a trait, the essential sam- 
ple size and number of markers to detect a certain pat- 
tern can be estimated. 

Availability and requirements 

♦ Project name: phenosim 

♦ Project home page: http://evoplant.uni-hohen- 
heim.de 

♦ Operating system(s): Platform independent 

♦ Programming language: Python 

♦ Other requirements: Python 2.X 

♦ License: no license required 

♦ Any restrictions to use by non-academics: none 
Additional material 

r >> 

Additional file 1: phenosim v0.15. The archive includes the current 
version of phenosim as well as a documentation of its usage. For 
updated versions, please visit the authors' website http://evoplant.uni- 
hohenheim.de. 
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running time with other software tools, we simulated two 
QTNs and 249,998 neutral loci in a population of 500 
diploid individuals using quant iNemo[ 12]. In six min- 
utes, quant iNemo generated -120 generations. As the 
expected coalescent time for a sample is ~ 4N e generations 
[28], this is by far not enough to get a realistic variation 
pattern comparable to what can be achieved by GENOME 
in the same time. Although forward simulations like 
quant iNemo allow more complex demographic, selec- 
tion and trait scenarios, the combination of coalescent 
simulators and phenosim is much more suitable for gen- 
erating multiple simulations of large sample sizes. 

Conclusions 

Demographic effects, genetic architecture, selection, and 
different mutation and recombination rates affect the 
ability to detect the genetic basis of complex traits in 
natural populations [4]. Such population genetic para- 
meters can now be estimated from genome-wide marker 
sets prior to further analyses. Since GWAS are widely 
used in plant and animal genetics, there is a great inter- 
est in assessing the power of a particular study or 
method. Using coalescent simulations in conjunction 
with phenosim, one can investigate the statistical 
power and other characteristics of GWAS methods 
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